This exploratory analysis of America’s favorite pastime takes an in-depth look at the history of Negro League play and the statistics they generated.
The data we are using is from Baseball Reference, though it was compiled by Seamheads from newspapers and other primary sources of the time. It is important to note that the data set is not complete by any means, as box scores and record keeping were not as comprehensive in the Negro Leagues as they were for the American and National Leagues. In addition, the Negro Leagues played many exhibition, barnstorming, and other types of games that are not included in the data set. For this reason, our analysis leans toward per-game or per-plate appearance stats more than total numbers.
Our data includes all Negro League pitchers and position players from 1920 to 1948 (the period they have been designated major leagues by MLB) in the Baseball Reference database, with the aforementioned caveats. Our data set includes all hitters with at least 100 plate appearances (PA), and all pitchers who appeared in at least 10 games. The data set only includes the career stats each player put up while in the Negro Leagues (for instance, Jackie Robinson’s statistics with the Dodgers are not included in the set).
The median number of career plate appearances in our dataset was 374, with the mean being 672.2. The median and mean career on-base percentage plus slugging percentage (OPS) was .655 (more on this statistic later). For pitchers, the median career earned run average (ERA) was 4.50, with the mean ERA being 4.74. The median career fielding independent pitching (FIP) was 2.80, with the mean FIP being 2.77.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
It may seem tautological, but batters who received more plate appearances and pitchers who pitched more games were better players than those with less. You can see this in the tables below, as career ERA and OPS were better for players with more plate appearances and games pitched, despite the fact that these statistics include late-career declines for some players.
| over1000pa | OPS |
|---|---|
| over 1000 PA | 0.755 |
| under 1000 PA | 0.630 |
| over100g | ERA |
|---|---|
| over 100 G | 3.89 |
| under 100 G | 4.90 |
One of the greatest baseball managers of all time, Earl Weaver of the Baltimore Orioles, famously said “your most precious possessions on offense are your 27 outs.” On-base percentage (OBP) is a useful statistic because it measures a hitter’s ability to avoid an out. It is superior to earlier statistics like batting average because it factors in a batter’s ability to draw walks.
\(OBP = \frac{H+BB+HBP}{AB+BB+HBP+SF}\)
OBP was made famous by Michael Lewis’s 2003 book Moneyball (later adapted into a movie), which chronicled the 2002 Oakland Athletics and their data-driven quest to build a competitive roster with a much lower payroll than other MLB teams. In a memorable scene from the movie, Brad Pitt, playing general manager Billy Beane, describes why he pursues certain players to the older, less data-driven scouts who don’t understand. Over their objections to superficial factors like weight or off-field issues, Beane simply states, “He gets on-base.”
Here is an interactive visual of OBP on the X-axis and slugging percentage (SLG) (essentially a measure of power) on the Y-axis for Negro Leaguers from 1920 to 1948 with at least 100 plate appearances.
We created two different radar charts to outline a players offensive and defensive capabilities. The first radar chart is comprised of multiple offensive statistics, showing how many singles, doubles, triples, and home runs a player hit per plate appearance (PA) compared to a league average hitter. The players we have chosen to look at for the charts are the Hall of Fame pitcher Satchel Paige, and the Hall of Fame batter Josh Gibson.
This radar chart shows a pitcher’s strikeouts per nine innings pitched (K/9), walks per nine innings pitched (BB/9), and home runs allowed per nine innings pitched (HR/9). These are considered the stats a pitcher has the most control over (i.e. stats that are not dependent on the fielders behind him), so they are useful in analyzing a pitcher. It is better for a pitcher to have a higher K/9 and a lower BB/9 and HR/9.
The gray triangle indicates the stats of an average Negro League pitcher, and the red triangle indicates the stats of Satchel Paige. As the chart makes clear, Satchel Paige was much more impressive in all three categories compared to league average.
The radar chart above is shaped in a diamond, like a baseball field, to visualize these statistics a bit easier. Josh Gibson was a catcher for the Negro Leagues who is considered to be one of baseball’s most powerful hitters of all time; he was also the second Negro Leaguer to be inducted into the National Baseball Hall of Fame. This specific radar chart is an overview of his offensive prowess and hitting capability. The smaller gray diamond, pointing towards ‘singles’, is the average Negro Leaguer’s statistics in these areas. As you can see, Josh Gibson would hit a lot more home runs on average than other players in the Negro Leagues. This chart allows the average reader to clearly visualize the baseball diamond and understand what this chart means in regards to data visualization.
Finally, we created a bar chart to summarize our data that simply shows total number of home runs for Negro League players from 1920 - 1948. This bar chart is filtered to players that have hit over 70 home runs in their career so the data would be viewable, as many players hit at least some home runs during their career. This bar chart was organized so that the player with the most home runs was at the top and designed such that the bar colors would be appealing for the reader.